CPSC 545/445 (Autumn 2003) - Class 5: Pairwise Sequence Alignment
Module 2, Part 1


Module 2: Sequence Alignment

Note: Biological sequences evolve
	(dna: single base substitution, deletion, insertion, ...)

Problem: Test two sequences (dna, protein, rna) for similarities 
=> pairwise sequence alignment (psa)

Applications:
- phylogenetic tree reconstruction 
- find functional similarities
  (similar sequences => similar structure => similar function)

Need:
- definition of type of alignment (dna, rna, protein; two or more sequences)
- score for evaluating alignments


---
2.1 Global Pairwise Sequence Alignment Problem:

Given: x = x[1] x[2] ... x[n]
	and y = y[1] y[2] ... y[m]
Find an optimal alignment of x and y,
	i.e. an alignment with maximal score

simple case: n=m, assume only base substitutions (no gaps)
score = sum of scores s(x[i],y[i]) - these are defined by scoring matrix
	(e.g., BLOSUM or PAM matrix)
no maximisation to be done!

more realistic: allow gaps (to model effects of deletions and insertions)
problem: where did deletions and insertions occur most likely,
	i.e., where to place the gaps?

definition: a pairwise Sequence alignment of x,y 
	is pair of sequences x',y' possibly containing gaps "-" with:
	- x' with gaps remove = x, y' with gaps removed = y
	- length(x') = length(y')
	- x'[i] = y'[i] = gap never happens

the score of an alignment x',y' is defined as the sum
of scores over all non-gap aligned pairs (x'[i],y'[i])
+ scores for regions containing gaps
(additive score)

---
gap scoring:

gaps are penalised; gamma(g) = score (penalty) for gap of length g

linear gap score: for gap of length g, gamma(g)=-gd  
	(d = gap penalty constant, 
	typically = 8 in practice for protein psa) 

affine gap score: for gap of length g, gamma(g)=-d-(g-1)e
	(d = gap-open penalty, e = gap-extension penalty, d > e;
	typically d=12, e=2 in practice for protein psa)

motivation for affine gap scores:
- gaps are often longer than one residue
- mutation can insert more than one residues at once
- matching cDNA to genomic DNA (introns/exons)

general gap score: arbitrary function gamma(g)


(scoring matrices and gap penalties can be derived from and
related to probabilistic models of sequence evolution; details later)


---
2.2 finding optimal psa (using linear gap scores)

brute-force-approach:
- enumerate & score all alignments
  problem: (2n choose n) ~= 2^2n / sqrt(2 pi n) many! (exponential in n)
=> infeasible for realistic sequence lengths

much better:
dynamic programming approach (Needleman-Wunsch algorithm):

idea: build up optimal alignments based on optimal alignments
	for smaller subsequences

recursively build 2-dimensional matrix F, 
where F(i,j) = score of best alignment between x[1..i] and y[1..j] 

F(0,0)=0
F(i,0)=-id		% x aligned to all gaps
F(0,j)=-jd		% y aligned to all gaps
F(i,j) = max { F(i-1,j-1) + s(x[i], y[i]), 	% align x[i] with y[i]
		F(i-1,j) - d, 			% align x[i] with gap
		F(i,j-1) - d } 			% align y[i] with gap

fill matrix from top left to bottom right,
store pointers according to the term in the max above that
achieved the maximum

=> F(m,n) = score of optimal alignment
	the alignment itself is constructed by tracing back the pointers
	from cell F(m,n) to F(0,0)

(this algorithm works because score = sum of independent pieces)

complexity: O(m*n) time and memory
(memory complexity can be reduced to O(m+n) with only minor increase
  in computation time using dynamic programming + divide and conquer
  approach; see [BSA], Ch.2.6)

---
2.3 local pairwise sequence alignment

Often, one wants to align subsequences of two larger sequences. 

Examples:
- locating common domains in proteins (e.g., transmembrane proteins)
- comparing extended sections of genomic DNA sequence
- detecting similarity between highly diverged sequences
  (only certain parts of sequence conserved enough for alignment,
  rest may have accumulated too much noise)

local alignments are generally not subsets of global alignments!

naīve approach: 
- pick each possible pair of subsequences from x,y. 
  use global alignment algorithm to align those. 
  -> O(m^2 * n^2) possible subsequences of length O(m), O(n), 
	each global alignment takes time O(m*n)

--
dynamic programming approach (smith-waterman algorithm):

= modification of needlemen-wunsch algorithm,
  complexity: O(m*n) time & space

(again, linear gap scores)

same iterative construction process for dynamic programming matrix F(i,j)
as before, but:

F(i,j) :=  max { 
            F(i-1,j-1)+s(x[i],y[j]) 
            F(i-1,j) - d 
            F(i,j-1) - d 
            0 				% try new subsequence alignment
          } 

The forth case corresponds to starting a new subsequence alignment
at position i in x and j in y, instead of extending the best alignment
up to positions i,j so far.


Modified initialisation:

	F(i,0) := max{0,-id} = 0
	F(0,j) := max{0,-dj} = 0

(starting and ending gaps are no longer penalised, since we are aligning
subsequences only)

The score of optimal alignment is no longer necessarily given
by F(m,n), but can occur at any matrix position.
Likewise, tracing the pointers backwards will typically stop
before reaching F(0,0).


For local alignment to work, the expected score for a random match 
(of unrelated sequenc sections) must be negative.
Otherwise, long matches between unrelated sequences would have high
scores, just based on their length, leading to global or nearly
global alignments.

-> global vs. local alignment behaviour of smith-waterman highly
  dependent on scoring system used.

For ungapped case, theoretical analysis of the requirements
for the expected score of a random match to be negative is possible;
it is based on an unrealistic assumption (all sequence positions
evolve independently) and does not extend to the (more practically
relevant) gapped case. 

Nevertheless, consequences of the theoretical
model in terms of requirements of scoring systems seem to be 
practically relevant.

In practice, balance between local/global behaviour of local psa
is achieved through empirical methods (experimentation, measurements).


---
Resources:

BSA, Ch.2 [good introduction; contains further references]